Skip to content

Conversation

@yhmtsai
Copy link
Member

@yhmtsai yhmtsai commented Jul 21, 2025

This PR adds the event to synchronize.
I put event in executor_event collection and operations tag in profile_hook.
To distinguish profiler_hook additional sync and function sync, I allow profiler hook to call synchronize_impl such that they do not have usual syncrhonize event.

BTW, this PR is not enough such that we can distinguish the real copy time from waiting the previous gpu activity because the result in residual_norm copy is pageable memory unless we are on grace hopper or use cudaMallocHost for the resulting memory, which will handled in another PR. According to https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async, it might sync with respect to host.
In practice, the current behavior synchronizes during the cudaMemcpyAsync, so we have two sync - copy and explicit sync actually. Unfortunately, we can not eliminate the sync because cuda only mentions it "might" sync.

@yhmtsai yhmtsai requested a review from a team July 21, 2025 08:39
@yhmtsai yhmtsai self-assigned this Jul 21, 2025
@yhmtsai yhmtsai added the 1:ST:ready-for-review This PR is ready for review label Jul 21, 2025
@ginkgo-bot ginkgo-bot added mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. mod:dpcpp This is related to the DPC++ module. labels Jul 21, 2025
@upsj
Copy link
Member

upsj commented Jul 21, 2025

This looks like it is only useful for profiler_hook loggers that don't distinguish between CPU and GPU timing, i.e. only the summary and nested_summary loggers, right? I'm not sure yet if this is enough of a reason to add a separate logger event for it. If we use device timers, we get correct device timings on the high level, if we use host timers, we get correct host timings, if we want to get some mixed information, you get more reliable results from a purpose-built tracing tool like NSight Systems or rocProfiler.

Additionally, it seems like this would add some additional noise to the profiler output for executors where synchronize is a no-op.

@yhmtsai
Copy link
Member Author

yhmtsai commented Jul 21, 2025

I am not big fan to provide this when we can use the vendor profiler directly.
It should not give misleading information if we provide it.
On later point, we may have benchmark with two loggers for gpu and cpu and it will help to know cpu time is used in syncrhonization not actual function although it needs different host allocator from #1897

@upsj
Copy link
Member

upsj commented Jul 21, 2025

Let me be a bit more precise: The noise is the biggest issue to me. If the executor doesn't do anything when synchronizing, it also shouldn't show up in the output.

Looking at where we actually call synchronize explicitly in Ginkgo, this seems all related to distributed use cases or copies (there are a few other places, but I think they are unnecessary). For that purpose, maybe it makes sense for us to spend some more time thinking of asynchronicity and futures, since that's essentially what we need here to avoid the duplicate synchronization. With explicit device-related futures, we can provide a function that waits for multiple streams (and if they are the same, remove any duplicate waits), which would then allow us to add asynchronous copy functionality to all executors. But also to be clear about terminology: I don't mean using multiple threads, I mean using a single thread to poll multiple futures. If we get this right, we can also cleanly extend it to a multi-stream setup, which would allow us even more overlapping of operations.

@yhmtsai
Copy link
Member Author

yhmtsai commented Jul 21, 2025

Asynchronous is different topic.
We call that, so we should have it in log. if it is a noise because of no-op, we should avoid no-op call not avoid logging no-op.
If you mean the duplicated wait from copy itself and the explicit wait in gko copy,
We can not avoid duplicate synchronous, because the documentation only mention it might sync so it might not if the memory is pageable. Also, we can not force user to use cudaMallocHost to allow actual async copy here.
If you mean the duplicated wait from many copies, it is one slow down factor when we call several copy at the same time, which should be avoided.
However, you still need to call wait when you try to get the value which is indeed the updated value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1:ST:ready-for-review This PR is ready for review mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:dpcpp This is related to the DPC++ module. mod:hip This is related to the HIP module.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants